%>%%>%“Web scraping is the process of automatically mining data or collecting information from the World Wide Web.” – Wikipedia
Web scraping is a flexible method to extract data from the internet. It can involve extracting numerical or textual data.
There are many uses for web scraping, including but not limited to:
Good news! You can easily check with the robotstxt package.
Netflix does not allow you to scrape their site.
Hyper Text Markup Language
“HTML is the standard markup language for creating Web pages.”
Cascading Style Sheets
“CSS describes how HTML elements are to be displayed on screen, paper, or in other media.”
– W3Schools
Image credit: Professor Shawn Santo
HTML is structured with “tags,” indicating portions of a page.
Tags can be called by their structure.
Tags can be nested.
A few important tags (of many) for scraping:
<h1> header tags </h1><p> paragraph elements </p><ul> unordered bulleted list </ul><ol> ordered list </ol><li> individual list item </li><div> division </div><table> table </table>Extracting parts of a website can be daunting if unfamiliar with CSS.
SelectorGadget is helpful (Chrome only).
Inspect the page elements is also helpful (most major browsers).
HTML - syntax is easier and aligns with HTML tags
XPATH - useful when the node isn’t uniquely identified with CSS
Set up the environment to scrape the site.
That’s it!
Seems appropriate to pull R book data from Amazon.
We are good to scrape!
Before you get started, you must specify the URL.
amazon <- read_html("https://www.amazon.com/s?k=R&i=stripbooks&rh=n%3A283155%2Cn%3A75%2Cn%3A13983&dc&qid=1592086532&rnid=1000&ref=sr_nr_n_1")Data as of 2020-07-07
amazon %>%
html_nodes(".s-line-clamp-2") %>%
html_text() -> titles
head(titles)
#> [1] "\n \n \n \n\n\n\n\n\n \n \n \n R for Data Science: Import, Tidy, Transform, Visualize, and Model Data\n \n \n \n \n\n\n \n"
#> [2] "\n \n \n \n\n\n\n\n\n \n \n \n The Book of R: A First Course in Programming and Statistics\n \n \n \n \n\n\n \n"
#> [3] "\n \n \n \n\n\n\n\n\n \n \n \n Discovering Statistics Using R\n \n \n \n \n\n\n \n"
#> [4] "\n \n \n \n\n\n\n\n\n \n \n \n R Graphics Cookbook: Practical Recipes for Visualizing Data\n \n \n \n \n\n\n \n"
#> [5] "\n \n \n \n\n\n\n\n\n \n \n \n Advanced R, Second Edition (Chapman & Hall/CRC The R Series)\n \n \n \n \n\n\n \n"
#> [6] "\n \n \n \n\n\n\n\n\n \n \n \n Analyzing Baseball Data with R, Second Edition (Chapman & Hall/CRC The R Series)\n \n \n \n \n\n\n \n"The element pulls a number of breaks and blank spaces.
Let’s clean this up with str_trim.
\n and white space from the titlestitles <- str_trim(titles) # Removes leading & trailing space
head(titles)
#> [1] "R for Data Science: Import, Tidy, Transform, Visualize, and Model Data"
#> [2] "The Book of R: A First Course in Programming and Statistics"
#> [3] "Discovering Statistics Using R"
#> [4] "R Graphics Cookbook: Practical Recipes for Visualizing Data"
#> [5] "Advanced R, Second Edition (Chapman & Hall/CRC The R Series)"
#> [6] "Analyzing Baseball Data with R, Second Edition (Chapman & Hall/CRC The R Series)"amazon %>%
html_nodes("a.a-size-base.a-link-normal.a-text-bold") %>%
html_text() -> format
head(format)
#> [1] "\n \n \n \n Paperback\n \n \n"
#> [2] "\n \n \n \n Kindle\n \n \n"
#> [3] "\n \n \n \n Paperback\n \n \n"
#> [4] "\n \n \n \n eTextbook\n \n \n"
#> [5] "\n \n \n \n Paperback\n \n \n"
#> [6] "\n \n \n \n Kindle\n \n \n"The price structure splits price into two elements. We must pull each and combine them into a single price.
This element is messier and we’ll need a number of cleaning steps.
amazon %>%
html_nodes("div.a-row.a-size-small") %>%
html_text() -> rate_n
head(rate_n)
#> [1] "\n\n\n\n \n\n\n\n\n\n\n \n \n \n 4.7 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 427\n \n \n \n \n\n\n\n"
#> [2] "\n\n\n\n \n\n\n\n\n\n\n \n \n \n 4.3 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 76\n \n \n \n \n\n\n\n"
#> [3] "\n\n\n\n \n\n\n\n\n\n\n \n \n \n 4.5 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 255\n \n \n \n \n\n\n\n"
#> [4] "\n\n\n\n \n\n\n\n\n\n\n \n \n \n 4.7 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 14\n \n \n \n \n\n\n\n"
#> [5] "\n\n\n\n \n\n\n\n\n\n\n \n \n \n 4.8 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 31\n \n \n \n \n\n\n\n"
#> [6] "\n\n\n\n \n\n\n\n\n\n\n \n \n \n 4.4 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 14\n \n \n \n \n\n\n\n"rate_n <- str_trim(rate_n)
head(rate_n)
#> [1] "4.7 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 427"
#> [2] "4.3 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 76"
#> [3] "4.5 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 255"
#> [4] "4.7 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 14"
#> [5] "4.8 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 31"
#> [6] "4.4 out of 5 stars\n \n \n \n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n 14"Let’s assemble the file!
length(titles)
#> [1] 16
length(format)
#> [1] 36
length(price)
#> [1] 36
length(rating)
#> [1] 14
length(rate_n)
#> [1] 14
length(pub_dt)
#> [1] 16Wait! What?!?
An issue with scraping is sometimes you get an uneven number of records due to missing data elements.
We can fix this!
…manually…
All titles were populated and scraped accurately. However, due to multiple formats, these records must be repeated to fill the dataframe.
Some titles have more than 2 formats.
Nothing needed here!
Or here!
Two books don’t have ratings.
Like titles, the ratings need to be repeated.
The same corrections are done here.
Books with more than 2 formats.
Not all titles have a rating and won’t have a rating count.
Like titles, the rating counts need to be repeated.
The same corrections are done here.
Books with more than 2 formats.
Like titles, the publication dates need to be repeated.
Books with more than 2 formats.
r_books <- tibble(title = titles,
text_format = format,
price = price,
rating = rating,
num_ratings = rate_n,
publication_date = pub_dt)
head(r_books)
#> # A tibble: 6 x 6
#> title text_format price rating num_ratings publication_date
#> <chr> <chr> <dbl> <dbl> <dbl> <date>
#> 1 R for Data Science: Imp~ Paperback 40.1 4.7 427 2017-01-10
#> 2 R for Data Science: Imp~ Kindle 25.0 4.7 427 2017-01-10
#> 3 The Book of R: A First ~ Paperback 33.0 4.3 76 2016-07-16
#> 4 The Book of R: A First ~ eTextbook 30.0 4.3 76 2016-07-16
#> 5 Discovering Statistics ~ Paperback 34.4 4.5 255 2012-04-05
#> 6 Discovering Statistics ~ Kindle 61.6 4.5 255 2012-04-05Web Scraping in R & rvest on GitHub
This talk is freely distributed under the MIT License.